By the end of the lecture, you will be able to …
Download and open code-along-04.qmd
Load the standard packages.
https://gss.norc.org/get-the-data/stata
Heads Up!
Save the unzipped file in your class data folder.
# Use here() to construct the file path
gss_panel.dta <- here("data", "GSS_2020_panel_stata_1a/gss2020panel_r1a.dta")
#load the data using `haven::read_dta()`
data <- read_dta(gss_panel.dta)
# Or, do both at the same time!
# data <- read_dta(here("data", "GSS_2020_panel_stata_1a/gss2020panel_r1a.dta"))Study of former 2016 and 2018 GSS respondents were interviewed again in 2020
# A tibble: 10 × 7
yearid year_1a year_1b year_2 age_1a age_1b age_2
<dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1 20182183 NA 2018 NA NA(i) 52 NA(i)
2 20180711 NA 2018 NA NA(i) 19 NA(i)
3 20182189 NA 2018 2020 NA(i) 37 39
4 20160354 2016 NA NA 56 NA(i) NA(i)
5 20180452 NA 2018 2020 NA(i) 29 31
6 20181503 NA 2018 2020 NA(i) 58 60
7 20162744 2016 NA 2020 71 NA(i) 75
8 20160315 2016 NA 2020 69 NA(i) 73
9 20160170 2016 NA NA 75 NA(i) NA(i)
10 20161888 2016 NA NA 71 NA(i) NA(i)
Match variables according to a given pattern.
starts_with(): Starts with an exact prefix.ends_with(): Ends with an exact suffix.contains(): Contains a literal string.head() & tail()Look at the first few column names and first few rows.
# A tibble: 5 × 14
yearid wtssnr_2 age_1a age_1b age_2 family16_1a family16_1b family16_2
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
1 20160001 1.44 47 <NA> 51 both own mother … <NA> not avail…
2 20160002 0.722 61 <NA> 65 both own mother … <NA> not avail…
3 20160003 NA 72 <NA> <NA> both own mother … <NA> iap
4 20160004 2.89 43 <NA> 47 mother only <NA> not avail…
5 20160005 NA 55 <NA> <NA> both own mother … <NA> iap
# ℹ 6 more variables: socfrend_1a <fct>, socfrend_1b <fct>, socfrend_2 <fct>,
# childs_1a <fct>, childs_1b <fct>, childs_2 <fct>
Look at the first few column names and last few rows.
# A tibble: 5 × 14
yearid wtssnr_2 age_1a age_1b age_2 family16_1a family16_1b family16_2
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
1 20182344 NA <NA> 37 <NA> <NA> mother and stepf… iap
2 20182345 0.995 <NA> 75 77 <NA> both own mother … not avail…
3 20182346 0.995 <NA> 67 70 <NA> both own mother … not avail…
4 20182347 NA <NA> 72 <NA> <NA> both own mother … iap
5 20182348 NA <NA> 79 <NA> <NA> both own mother … iap
# ℹ 6 more variables: socfrend_1a <fct>, socfrend_1b <fct>, socfrend_2 <fct>,
# childs_1a <fct>, childs_1b <fct>, childs_2 <fct>
This data is NOT tidy!
Some column names include values of a variable (survey year).
# A tibble: 15,645 × 7
yearid wtssnr_2 panel age family16 socfrend childs
<dbl> <dbl> <chr> <fct> <fct> <fct> <fct>
1 20160001 1.44 1a 47 both own mother and father several tim… 3
2 20160001 1.44 1b <NA> <NA> <NA> <NA>
3 20160001 1.44 2 51 not available for this year several tim… 3
4 20160002 0.722 1a 61 both own mother and father several tim… 0
5 20160002 0.722 1b <NA> <NA> <NA> <NA>
6 20160002 0.722 2 65 not available for this year about once … 0
7 20160003 NA 1a 72 both own mother and father <NA> 2
8 20160003 NA 1b <NA> <NA> <NA> <NA>
9 20160003 NA 2 <NA> iap <NA> <NA>
10 20160004 2.89 1a 43 mother only once or twi… 4
# ℹ 15,635 more rows
This data is tidy!
Each variable in its own column, and each observation in its own row.
pivot_longer()my_data_long <- my_data |>
pivot_longer(
cols = 3:14,
names_to = "variable",
values_to = "value")
head(my_data_long, n = 5)cols = specifies which columns you want to turn into one
names_to = defines the name of the new variable containing the current variable names
values_to= defines the name of the new variable that takes in the values of the variables
# A tibble: 5 × 4
yearid wtssnr_2 variable value
<dbl> <dbl> <chr> <fct>
1 20160001 1.44 age_1a 47
2 20160001 1.44 age_1b <NA>
3 20160001 1.44 age_2 51
4 20160001 1.44 family16_1a both own mother and father
5 20160001 1.44 family16_1b <NA>
This is also not tidy data!
The variable column contains the variable names & the panel ID.
separate()my_data_long <- my_data |>
pivot_longer(
cols = c(-yearid, -wtssnr_2),
names_to = "variable",
values_to = "value") |>
separate_wider_delim(variable,
delim = "_",
names = c("variable", "panel"))
head(my_data_long, n = 5)variable into two using a delimiter
names specifies the two new column names
# A tibble: 5 × 5
yearid wtssnr_2 variable panel value
<dbl> <dbl> <chr> <chr> <fct>
1 20160001 1.44 age 1a 47
2 20160001 1.44 age 1b <NA>
3 20160001 1.44 age 2 51
4 20160001 1.44 family16 1a both own mother and father
5 20160001 1.44 family16 1b <NA>
This is still not tidy data!
The value variable is a mix of different types of values
pivot_wider()my_data <- my_data |> # overwriting my_data
pivot_longer(
cols = c(-yearid, -wtssnr_2),
names_to = "variable",
values_to = "value") |>
separate_wider_delim(variable,
delim = "_",
names = c("variable", "panel")) |>
pivot_wider(
names_from = variable,
values_from = value)
head(my_data, n = 5)# A tibble: 5 × 7
yearid wtssnr_2 panel age family16 socfrend childs
<dbl> <dbl> <chr> <fct> <fct> <fct> <fct>
1 20160001 1.44 1a 47 both own mother and father several time… 3
2 20160001 1.44 1b <NA> <NA> <NA> <NA>
3 20160001 1.44 2 51 not available for this year several time… 3
4 20160002 0.722 1a 61 both own mother and father several time… 0
5 20160002 0.722 1b <NA> <NA> <NA> <NA>
my_data <- my_data |>
mutate(panel = case_when(
panel == "1a" ~ 2016,
panel == "1b" ~ 2018,
panel == "2" ~ 2020,
TRUE ~ NA_integer_))
head(my_data, n = 3)# A tibble: 3 × 7
yearid wtssnr_2 panel age family16 socfrend childs
<dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 20160001 1.44 2016 47 both own mother and father several time… 3
2 20160001 1.44 2018 <NA> <NA> <NA> <NA>
3 20160001 1.44 2020 51 not available for this year several time… 3
Heads Up!
family16 is a time-invariant variable.
relocate()# A tibble: 2 × 7
panel yearid wtssnr_2 age family16 socfrend childs
<dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 2016 20160001 1.44 47 both own mother and father several times… 3
2 2018 20160001 1.44 <NA> <NA> <NA> <NA>
arrange()# A tibble: 15,645 × 4
yearid panel age family16
<dbl> <dbl> <fct> <fct>
1 20160001 2016 47 both own mother and father
2 20160002 2016 61 both own mother and father
3 20160003 2016 72 both own mother and father
4 20160004 2016 43 mother only
5 20160005 2016 55 both own mother and father
6 20160006 2016 53 other
7 20160007 2016 50 both own mother and father
8 20160008 2016 23 both own mother and father
9 20160009 2016 45 both own mother and father
10 20160010 2016 71 both own mother and father
# ℹ 15,635 more rows
# A tibble: 15,645 × 4
yearid panel age family16
<dbl> <dbl> <fct> <fct>
1 20160001 2020 51 not available for this year
2 20160002 2020 65 not available for this year
3 20160003 2020 <NA> iap
4 20160004 2020 47 not available for this year
5 20160005 2020 <NA> iap
6 20160006 2020 <NA> iap
7 20160007 2020 <NA> iap
8 20160008 2020 27 not available for this year
9 20160009 2020 49 not available for this year
10 20160010 2020 <NA> iap
# ℹ 15,635 more rows
APPEND
add new observations (rows) to existing variables
MERGE
add new variables (columns) to existing observations (many merge types)
dataframe 1
coupleid name age
1 2 John 42
2 1 Megan 36
3 3 Bin 38
dataframe 2
coupleid name age
1 1 Sue 40
2 3 Ye-jin 39
3 2 Chrissy 35
dataframe 3
coupleid marstat numchild country
1 3 1 1 S.Korea
2 1 0 0 US
3 2 1 4 US
bind_rows()add columns from df1 to df2, matching observations based on the keys
left_join() keeps all observations in df1.right_join() keeps all observations in df2.full_join() keeps all observations in df1 and df2.inner_join() only keeps observations from df1 that have a matching key in df2left_join()Are married people above or below average in internet use or income? Does it vary by survey year?
How do we find out?
Answering this research question takes a few steps. But, the first step is to create a dataframe with all the necessary information.
Can you reproduce this table?
# A tibble: 15,645 × 7
yearid panel marital wwwhr realrinc avg_www avg_inc
<dbl> <dbl> <fct> <dbl+lbl> <dbl+lbl> <dbl> <dbl>
1 20160001 2016 married 15 164382. 14.2 24929.
2 20160001 2018 <NA> NA(i) NA(i) 14.8 26103.
3 20160001 2020 married 20 147659. 15.0 28222.
4 20160002 2016 never married 5 [never married] 25740 14.2 24929.
5 20160002 2018 <NA> NA(i) NA(i) 14.8 26103.
6 20160002 2020 never married 10 23980 15.0 28222.
7 20160003 2016 married NA(i) NA(i) 14.2 24929.
8 20160003 2018 <NA> NA(i) NA(i) 14.8 26103.
9 20160003 2020 <NA> NA(i) NA(i) 15.0 28222.
10 20160004 2016 married 7 5265 14.2 24929.
# ℹ 15,635 more rows
What’s your conclusion to our research question?
# Income status
ctable(think_married$income_status, think_married$panel,
prop = "c",
format = "p",
useNA = "no",
headings = FALSE
)
--------------- ------- -------------- -------------- -------------- ---------------
panel 2016 2018 2020 Total
income_status
Above Average 328 ( 43.8%) 235 ( 38.3%) 247 ( 45.6%) 810 ( 42.5%)
Below Average 421 ( 56.2%) 379 ( 61.7%) 295 ( 54.4%) 1095 ( 57.5%)
Total 749 (100.0%) 614 (100.0%) 542 (100.0%) 1905 (100.0%)
--------------- ------- -------------- -------------- -------------- ---------------
# Internet status
ctable(think_married$internet_status, think_married$panel,
prop = "c",
format = "p",
useNA = "no",
headings = FALSE
)
----------------- ------- -------------- -------------- -------------- ---------------
panel 2016 2018 2020 Total
internet_status
Above Average 207 ( 29.4%) 180 ( 29.4%) 165 ( 33.4%) 552 ( 30.5%)
Below Average 496 ( 70.6%) 433 ( 70.6%) 329 ( 66.6%) 1258 ( 69.5%)
Total 703 (100.0%) 613 (100.0%) 494 (100.0%) 1810 (100.0%)
----------------- ------- -------------- -------------- -------------- ---------------